This is the notebook for the Module 10 Programming Assignment.
Here are a few tips for using the iPython HTML notebook:
You should rename this notebook to be <your JHED id>_PR12.ipynb Do it right now.
Make certain the entire notebook executes before you submit it. (See Hint #5 above)
Change the following variables:
In [1]:
name = "Shehzad Qureshi"
jhed_id = "squresh6"
if name == "Student Name" or jhed_id == "sname1":
raise Exception( "Change the name and/or JHED ID...preferrably to yours.")
Add whatever additional imports you require here. Stick with the standard libraries and those required by the class. The import gives you access to these functions: http://ipython.org/ipython-doc/stable/api/generated/IPython.core.display.html (Copy this link) Which, among other things, will permit you to display HTML as the result of evaluated code (see HTML() or display_html()).
In [2]:
from IPython.core.display import *
from StringIO import StringIO
import numpy as np
from pprint import pprint
In this assignment you will be using the mushroom data from the Decision Tree module:
http://archive.ics.uci.edu/ml/datasets/Mushroom
The assignment is to program a Naive Bayes Classifier for this data. You'll first need to calculate all of the necessary probabilities (don't forget to use +1 smoothing). You'll then need to have a classify() function that takes an unlabeled instance and returns the normalized probability distribution over the possible classes.
Which classifier do you prefer for this problem...the decision tree or the naive bayes classifier? Why?
In [3]:
data = np.loadtxt('agaricus-lepiota.data', delimiter=',', dtype='S1')
We need to be able to determine all domain values of a particular variable.
In [4]:
def get_domain_values(data, col):
d = np.unique(data[:, col])
return d
In [5]:
print get_domain_values(data, 0) == ['e', 'p']
Now we need to be able to count the number of values for a particular variable. We also give the option to calculate the number of values given a particular label.
In [6]:
def get_domain_value_count(data, col, value, ycolumn=None, yvalue=None):
vals = (data[:, col] == value)
count = vals.sum()
if ycolumn is None or yvalue is None:
return count
yvals = (data[:, ycolumn] == yvalue)
count = np.logical_and(vals, yvals).sum()
return count
In [7]:
print get_domain_value_count(data, 0, 'e') == 4208
print get_domain_value_count(data, 0, 'p') == 3916
print get_domain_value_count(data, 1, 'x', 0, 'e') == 1948
This is a method for counting the number of entries in a particular column.
In [8]:
def get_column_count(data, col):
vals = data[:,col]
count = len(vals)
return count
In [9]:
get_column_count(data, 1) == 8124
Out[9]:
Now we need a method to calculate our probabilities. We will use the following formulas:
$P(label = e) = \frac {\# label = e}{\# labels}$
and
$P(attrib = x | label = e) = \frac{\#attrib = x ~and~ label = e + 1}{\# labels = e + 1}$
We add a value of 1 to the count for the latter formula for smoothing, and to prevent multiplication by 0.
In [10]:
def calc_probability(data, col, value, ycolumn=None, yvalue=None):
num_vals = get_domain_value_count(data, col, value)
num_total = get_column_count(data, col)
val_prob = float(num_vals) / (num_total)
if ycolumn is None or yvalue is None:
return val_prob
num_vals = get_domain_value_count(data, col, value, ycolumn, yvalue)
num_yvals = get_domain_value_count(data, ycolumn, yvalue)
val_prob = float(num_vals + 1) / float(num_yvals + 1)
return val_prob
In [11]:
print calc_probability(data, 0, 'e') == 4208/8124.
print calc_probability(data, 1, 'x', 0, 'e') == 1949/4209.
We're going to precalculate our probability table to save us from doing it over and over again. The output will be a list of dictionaries, with each dictionary element having a dictionary of the probability of each label. We pass in the domain values here because cross-validation tends to divide the dataset which may result in some domains not showing up in the training set.
In [12]:
def get_probability_table(data, domains):
probs = []
yvals = domains[0]
probs.append({value : calc_probability(data, 0, value) for value in domains[0]})
for i in range(1,len(data[0])):
probs.append ({value : {yvalue : calc_probability(data, i, value, 0, yvalue) for yvalue in yvals}
for value in domains[i]})
return probs
In [13]:
domains = [get_domain_values(data, i) for i in xrange(len(data[0]))]
%time probs = get_probability_table(data, domains)
pprint(probs[0])
pprint(probs[1])
This function will calculate the probability of an example given a specific class label. The formula is as follows:
$label = p(c) \cdot \prod_{i,j} p(a_i = v_j | c)$
where $c, a, v$ are the class label, attribute and attribute value respectively.
In [14]:
def probability_of(probs, instance, label):
probability = probs[0][label] * np.product(np.array([probs[i+1][instance[i]][label] for i in xrange(len(instance))]))
return probability
We will need to normalize the results so that they add up to 1.
In [15]:
def normalize(results, labels):
total = np.sum(np.array([results[label] for label in labels]))
r = {label : results[label]/total for label in labels}
return r
We will need to determine the best label from the probabilities returned. This is as simple as taking the maximum value of all labels.
In [16]:
def find_best(results, labels):
tmp = [(results[label], label) for label in labels]
return sorted(tmp, reverse=True)[0][1]
This is our simple Naive Bayes Classification function. It calculates the probabilities of every label for the given example and returns the best one (along with all the results).
In [17]:
def classify(probs, instance, labels=['e', 'p']):
results = {label : probability_of(probs, instance, label) for label in labels}
results = normalize(results, labels)
best = find_best(results, labels)
return (best, results)
In [18]:
i = np.random.randint(0, len(data))
print i, data[i][0]
classify(probs, data[i][1:])
Out[18]:
In [19]:
def partition(dataset, fold, k):
segments = np.array_split(dataset, k)
test = segments[fold-1]
training = [segments[i] for i in xrange(k) if i != (fold-1)]
return np.concatenate(training), test
Now we'll have to split the data into examples without the label.
In [20]:
def split_data(dataset):
x = dataset[:,1:]
y = dataset[:,0]
return x, y
This is our cross-validation function. It splits the dataset into a training and validation set and calculates the probability table with the training set and tests it against the test set. The output is a tuple of truth and test results which will be used later for statistical analysis.
In [21]:
def cross_validation(dataset, folds=10):
output = []
domains = [get_domain_values(data, i) for i in xrange(len(data[0]))]
for i in xrange(1, folds+1):
training_set, test_set = partition(dataset, i, folds)
probs = get_probability_table(training_set, domains)
x, y = split_data(test_set)
y_h = np.array([classify(probs, xval) for xval in x])
output.append((y, y_h))
return output
We'll need a way to print out our confusion matrix.
In [22]:
def confusion_matrix(y, y_h, pos, neg):
total = len(y)
tp = np.logical_and((y == pos), (y_h == pos)).sum()
tn = np.logical_and((y == neg), (y_h == neg)).sum()
fp = np.logical_and((y == pos), (y_h == neg)).sum()
fn = np.logical_and((y == neg), (y_h == pos)).sum()
return tp, tn, fp, fn
In [23]:
def print_confusion_matrix(tp, tn, fp, fn, pos, neg):
html = '''
<table>
<tr><td> </td><td>{0}</td><td>{1}</td></tr>
<tr><td>{2}</td><td>{3}</td><td>{4}</td></tr>
<tr><td>{5}</td><td>{6}</td><td>{7}</td></tr>
'''.format(pos, neg, pos, tp, fp, neg, fn, tn)
h = HTML(html)
return h
Along with the confusion matrix we will print out some stats.
In [24]:
def print_test_stats(total, tp, tn, fp, fn):
accuracy = (tp + tn)/float(total)
error = (fn + fp)/float(total)
precision = float(tp)/(tp + fp)
recall = float(tp)/(tp + fn)
classified = tp + tn + fp + fn
print "total:", total
print "total classified: {0} ({1:.3%})".format(classified, classified/float(total))
print "total unclassified: {0} ({1:.3%})".format(total - classified, (total - classified)/float(total))
print "accuracy: {0:.3%}".format(accuracy)
print "error: {0:.3%}".format(error)
print "precision: {0:.3%}".format(precision)
print "recall: {0:.3%}".format(recall)
Now we'll do our 10-fold cross-validation test.
In [25]:
dataset = np.loadtxt('agaricus-lepiota.data', delimiter=',', dtype='S1')
np.random.shuffle(dataset)
%time output = cross_validation(dataset)
and print out the results.
In [26]:
cm = np.array([confusion_matrix(y, y_h[:,0], 'p', 'e') for y, y_h in output])
tp, tn, fp, fn = cm.sum(axis=0)
total = 0
for _, y_h in output:
total += len(y_h)
print_test_stats(total, tp, tn, fp, fn)
print_confusion_matrix(tp, tn, fp, fn, 'p', 'e')
Out[26]:
Unlike Decision Trees, Naive Bayes Classification uses probability theory to determine classification labels. It's called "Naive" because it makes the assumption that all variables are independent, which in reality is not the case. But even though it makes this assumption it still performs relatively well as seen above.
I decided to take a stab of using a Random Forest implementation to see if I could reduce the error rate. Besides the fact that a 100-forest test ran for 17 minutes, the results showed no improvement whatsoever. I will presume that the reason for this is because the probability distribution is roughly the same for each tree in the forest. If this is the case then it makes sense that a forest did not show any improvement whatsoever.
In [26]: